{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Smoothing EBMs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this demonstration notebook, we are going to create an Explainable Boosting Machine (EBM) using a specially designed synthetic dataset. Our control over the data generation process allows us to visually assess how well the EBM is able to recover the original functions that were used to create the data. To understand how the synthetic dataset was generated, you can examine the full code on GitHub. This will provide insights into the underlying functions we are trying to recover. The full dataset generation code can be found in: [**_synthetic generation code_**](https://github.com/interpretml/interpret/blob/develop/python/interpret-core/interpret/utils/_synthetic.py)\n",
    "\n",
    "This notebook can be found in our [**_examples folder_**](https://github.com/interpretml/interpret/tree/develop/docs/interpret/python/examples) on GitHub."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install interpret if not already installed\n",
    "try:\n",
    "    import interpret\n",
    "except ModuleNotFoundError:\n",
    "    !pip install --quiet interpret scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# boilerplate - generate the synthetic dataset and split into test/train\n",
    "\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from interpret.utils import make_synthetic\n",
    "from interpret import show\n",
    "\n",
    "from interpret import set_visualize_provider\n",
    "from interpret.provider import InlineProvider\n",
    "set_visualize_provider(InlineProvider())\n",
    "\n",
    "seed = 42\n",
    "\n",
    "X, y, names, types = make_synthetic(classes=None, n_samples=50000, missing=False, seed=seed)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Train the Explainable Boosting Machine (EBM)</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The synthetic dataset has a significant number of smooth functions. To handle these smoothly varying relationships effectively, we incorporate a parameter called 'smoothing_rounds' in the EBM fitting process. 'smoothing_rounds' initiates the boosting process in a non-greedy manner by selecting random split points when constructing the internal decision trees. This strategy helps to avoid initial overfitting and sets up baseline smooth partial responses before changing to using a greedy approach that is better for fitting any remaining sharp transitions in the partial responses. We also use the reg_alpha regularization parameter to further smooth the results. EBMs additionally support reg_lambda and max_delta_step, which might be useful in some cases.\n",
    "\n",
    "For some datasets with large outliers, increasing the validation_size and/or taking the median model from the outer bags might be helpful as described here:\n",
    "https://github.com/interpretml/interpret/issues/548"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from interpret.glassbox import ExplainableBoostingRegressor\n",
    "\n",
    "ebm = ExplainableBoostingRegressor(names, types, interactions=3, smoothing_rounds=5000, reg_alpha=10.0)\n",
    "ebm.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Global Explanations</h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The visual graphs below confirm that the EBM was able to successfully recover the original data generation functions for this particular problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 0 - Cosine partial response generated on uniformly distributed data.\n",
    "\n",
    "show(ebm.explain_global(), 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 1 - Sine partial response generated on normally distributed data.\n",
    "\n",
    "show(ebm.explain_global(), 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 2 - Squared partial response generated on exponentially distributed data.\n",
    "\n",
    "show(ebm.explain_global(), 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 3 - Linear partial response generated on poisson distributed integers.\n",
    "\n",
    "show(ebm.explain_global(), 3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 4 - Square wave partial response generated on a feature with correlations\n",
    "#             to features 0 and 1 with added normally distributed noise.\n",
    "\n",
    "show(ebm.explain_global(), 4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 5 - Sawtooth wave partial response generated on a feature with a conditional \n",
    "#             correlation to feature 2 with added normally distributed noise.\n",
    "\n",
    "show(ebm.explain_global(), 5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 6 - exp(x) partial response generated on a feature with interaction effects \n",
    "#             between features 2 and 3 with added normally distributed noise.\n",
    "\n",
    "show(ebm.explain_global(), 6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 7 - Unused in the generation function. Should have minimal importance.\n",
    "\n",
    "show(ebm.explain_global(), 7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 8 - Linear partial response generated on a low cardinality categorical feature.\n",
    "#             The category strings end in integers that indicate the increasing order.\n",
    "\n",
    "show(ebm.explain_global(), 8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Feature 9 - Linear partial response generated on a high cardinality categorical feature.\n",
    "#             The category strings end in integers that indicate the increasing order.\n",
    "\n",
    "show(ebm.explain_global(), 9)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Interaction 0 - Pairwise interaction generated by XORing the sign of feature 0 \n",
    "#                 with the least significant bit of the integers from feature 3.\n",
    "\n",
    "show(ebm.explain_global(), 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Interaction 1 - Pairwise interaction generated by multiplying feature 1 and 2.\n",
    "\n",
    "show(ebm.explain_global(), 11)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Interaction 2 - Pairwise interaction generated by multiplying feature 3 and 8.\n",
    "\n",
    "show(ebm.explain_global(), 12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>For RMSE regression, the EBM's intercept should be identical to the mean</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(np.average(y_train))\n",
    "print(ebm.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Importances of the features and pairwise terms</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "show(ebm.explain_global())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Evaluate EBM performance</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from interpret.perf import RegressionPerf\n",
    "\n",
    "ebm_perf = RegressionPerf(ebm).explain_perf(X_test, y_test, name='EBM')\n",
    "print(\"RMSE: \" + str(ebm_perf._internal_obj[\"overall\"][\"rmse\"]))\n",
    "show(ebm_perf)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
